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Abstract 

Financial markets investors are involved in many games - they must interact with 
other agents to achieve their goals. Among them are those directly connected with 
their activity on markets but one cannot neglect other aspects that influence human 
decisions and their performance as investors. Distinguishing all subgames is usually 
beyond hope and resource consuming. In this paper we study how investors facing 
many different games, gather information and form their decision despite being un- 
aware of the complete structure of the game. To this end we apply reinforcement 
learning methods to the Information Theory Model of Markets (ITMM). Following 
Mengel, we can try to distinguish a class F of games and possible actions (strate- 
gies) a^, for i— th agent. Any agent divides the whole class of games into analogy 
subclasses she/he thinks are analogous and therefore adopts the same strategy for a 
given subclass. The criteria for partitioning are based on profit and costs analysis. 
The analogy classes and strategies are updated at various stages through the process 
of learning. We will study the asymptotic behaviour of the process and attempt to 
identify its crucial stages, eg existence of possible fixed points or optimal strategies. 
Although we focus more on the instrumental aspects of agents behaviours, vari- 
ous algorithm can be put forward and used for automatic investment. This line of 
research can be continued in various directions. 

Key words: econophysics, market games, learning 
PACS: 01.75.-Fm, 02.50.Ga, 02.70.-c, 42.30.Sy 



Preprint submitted to Elsevier 



4 December 2008 



Motto: 



"The central problem for gamblers is to find positive expectation bets. But the 
gambler also needs to know how to manage his money, i.e. how much to bet. 
In the stock market (more inclusively, the securities markets) the problem is 
similar but more complex. The gambler, who is now an investor, looks for 

excess risk adjusted return. " 

Edward O. Thorp 



1 Introduction 



Noise or structure? We face this question almost always while analyzing large 
data sets. Patern discovery is one of the primary concerns in various fields in 
research, commerce and industry. Models of optimal behaviour often belong 
to that class of problems. The goal of an agent in a dynamic environment is to 
make optimal decision over time. One usually have to discard a vast amount 
of data (information) to obtain a concise model or algorithm. Therefore pre- 
diction of individual agent behaviours is often burdened with large errors. The 
prediction game algorithm can be described as follows. 

FORn = 1,2,... 
Reality announces G X 
Predictor announces 7„ G F 
Reality announces ?/„ G F 
END FOR, 

where a;„ G X is the data upon which the prediction 7„ G F is made at 
each round n. The prediction quality is measured by some utility function 
i; : F X y ^ M. One can view such a process as a communication channel 
that transmit information from the past to the future pj|. The gathering of 
information, often indirect and incomplete, is referred to as measurements. 
Learning theory deals with the abilities and limitations of algorithms that 
learn or estimate functions from data. Learning helps with optimal behaviour 
decisions by adjusting agent's strategies to information gathered over time. 
Agents can base their action choices on prediction of the state of the environ- 
ment or on reward received during the process. For example, Markov decision 
process can be formulated as a problem of finding a strategy tt that maximizes 
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the expected sum of discounted rewards: 

s' 

where s is the initial state, is the action induced by the strategy vr, r is the 
reward at stage t and jS is the discount factor; v is called the value function. 

a^r) denote the (conditional) probabilitjo of reaching the state s' from 
the state s as result of action a^. It can be shown that, in the case of infinite 
horizon, an optimal strategy vr* such that (Bellman optimality equation) 

v{s, 71*) = max{r(s, a) + j3 ^p(s'|s, a) v{s', vr*)} 

" s' 

exists. In reinforcement learning, the agent receives rewards from the envi- 
ronment and uses them as feedback for its action. Reinforcement learning has 
its roots in statistics cybernetics, psychology, neuroscience, computer science 
... . In its standard formulation, the agent must improve his/her performance 
in a game through trial- and-error interaction with a dynamical environment. 
There are two ways of finding the optimal strategy: 

strategy iteration - one directly manipulates the strategy; 
value iteration - one approximates the optimal value function. 

Therefore two classes of algorithms are considered: strategy (policy) iteration 
algorithms and value iteration algorithms. In the following section we discuss 
the adequacy of reinforced learning in market games. 



2 Reinforcement learning in market games 



Can reinforcement learning help with market games analysis? Could it be used 
for finding optimal strategies? It not easy to answer this question because it 
involves the problem of real-time decision making one often have to (re-)act as 
quickly as possible. Consider model-free reinforcement learning, Q-learnin^3] 
In this approach one defines the value of an action Q{s, a) as a discounted 
return if action a following from the strategy tt is applied: 

Q*{s, a) = r(s, a) + /5^p(s'|s, a) v{s', vr*) 

s' 



In a more formal setting it would be a transition kernel of for the process of 
consecutive actions and observations. 

^ This is obviously a value iteration, but in market games there is a natural value 
function - the profit. 
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then 

v{s, n*) = maxQ*(s, a) . 

In Q-learning, the agent starts with arbitrary Q{s, a) and at each stage t 
observes the reward and the updates the value of Q according to the rule: 

Qt+i{s,a) = (1 - at) Qt{s,a) + atin + (3 max Qt{s,b)), 

b 

where at G [0, 1) is the learning rate that needs to decay over time for the 
learning algorithm to converge. This approach is frequently used in stochastic 
games setting. Watkins and Dayan proved that this sequence converges pro- 
vided all states and actions have been visited/performed infinitely often [5]. 
Therefore we anticipate weak convergence ratios. Indeed, various theoretical 
and experimental analyses PI7|B] showed that even in very simple games might 
require ~ 10^ steps! If a well-shaped stock trend is formed, one can expect 
that there are sorts of adversarial equilibria (no agent is hurt by any change 
of others' strategies) 

Ri{TTi, . . . , 7r„) < i?j(7ri, . . . , 7r-_i, vr-, tc[^^, • • • , vt^) 

or coordination equilibria (all agents achieve their highest possible return) 

Ri{7ii,..., Tin) = max Ri{ai,..., a„). 

ai ,...,a„ 

Here Rs denote the pay-off functions and vrs the one-stage strategies. The prob- 
lem is they can be easily identified with technical analysij^ tools and there 
is no need to recall to learning algorithms. In the most interesting classes 
of games neither adversarial equilibria nor coordination equilibria exist. This 
type of learning is much more subtle and, up to now, there is no satisfac- 
tory analysis in the field of reinforcement learning. Therefore a compromise 
is needed, for example we must be willing to accept returns that might not 
be optimal. The models discussed in the following subsections belong to that 
class and seem to be tractable by leaning algorithms. 

2.1 Kelly's criterion 

Kelly's criterion [H] can be successfully applied in horse betting or blackjack 
when one can discern biases [10] even though its optimality and convergence 
can be proven only in the asymptotic cases. The simplest form of Kelly's 
formula is: 

e = W - {1-W)/R 

where: 

^ We understand the term technical analysis as simplified hypothesis testing meth- 
ods that can be applied in real time. 
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• 6 = percentage of capital to be put into a single trade. 

• W = historical winning percentage of a trading system. 

• R = historical average win/loss ratio. 

Originally, Kelly's formula involves finding the "bias ratio" in a biased game. If 
the game is infinitely often repeated then one should put at stake the percent- 
age of one's capital equal to the bias ratio. Therefore one can easily construct 
various learning algorithms that perform the task of finding an environment 
so that Kelly's approach can be effectively applied (bias search + horizon of 
the investment) [TT1IT2] . 

2.2 MMM model 

Piotrowski and Sladkowski have analysed the model where the trader fixes a 
maximal price he is willing to pay for the asset and then, if the asset is 
bought, after some time sells it at random [13]. One can easily reverse the 
buying and selling strategies. The expected value of the of the profit after the 
whole cycle is 



where a is the withdrawal price. The maximal value of the function p, amax, 
lies at a fixed point of p, that is fulfills the condition p{amax) = CLmax- The 
simplest version of of the strategy is as follows: there an optimal strategy that 
fixes the withdrawal price at the level historical average profit Task: find an 
implementation of reinforced learning algorithm that can be used effectively 
on markets. We should control both, the probability distribution rj and the 
profit "quality". 

2.3 Learning across games 

An interesting approach was put forward by Mengel [H]. One can easily give 
examples of situations where agents cannot be sure in what game they are 
taking part (e.g. the games may have the same set of actions). Distinguishing 
all possible games and learning separately for all of them requires a large 
amount of alertness, time and resources (costs). Therefore the agent should 
try to identify some classes of situations she/he perceives as analogous and 
therefore takes the same actions. The learning algorithm should update both 
the partition of the set of games and actions to be taken: 

Or else: do not try to outperform yourself. 




^ + I~^V (P) dp 
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• Agents are playing repeatedly a game (randomly) drawn from a set F 

• Agents partition the set of all games into subset (classes) of games they 
learn not to discriminate (see them as analogous) 

• Agents update both propensities to use partitions {G} and attractions to- 
wards using their possible strategies/actions 

Asymptotic behaviour and computation complexity of such process is dis- 
cussed in Ref. [2]. Stochastic approximation is working in this case (approx- 
imation through a system of deterministic differential equations is possible). 
It would be interesting to analyse the following problems. Problem 1.: Iden- 
tify possible "classes of market games" Problem 2.: Identify "universal" set 
of strategies. For example, on the stock exchange one can try the brute force 
approach. Admit as strategies buying/selling at all possible price levels and 
identify classes of games with trends. Unfortunately, the number of approxima- 
tions generates huge transaction costs. This can be reduced on the derivative 
markets as due to the leverage the ratio of transaction cost to price movements 
is much lower. We envisage that an agent may try to optimize among various 
classes of technical analysis tools. 



3 Conclusion 

As conclusions we would like to stress the following points. 

Algorithms are simple but computation is complex, time and resource con- 
suming. 

Learning across games could be used to "fit" technical analysis models. 

Dynamic proportional investing (Kelly) should be the easiest to implement. 
But here we envisage problems analogous to related to heat (entropy) in 
thermodynamics, and exploration of knowledge might involve in cases of 
high effectiveness paradoxes |TT] analogous to those of arising when speed 
approaches the speed of light p!2] . 

One can envisage learning (information) models of markets/ portfolio theory. 

Implementation should be carefully tested - transaction costs can "kill" even 
crafty algorithms [T5] . 

Quantum algorithms/computers, if ever come true might change the situation 
in a dramatic way: we would have powerful algorithms at our disposal and 
and the learning limits would certainly broaden p^l7lll8] . 
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